Templatized Transformation using Databricks

Data transformation is the process of converting, cleansing, and structuring data into a usable format that can be analyzed to support decision making processes. The data transformation process converts raw data into a usable format by removing duplicates, converting data types, and enriching the dataset. The process involves defining the structure, mapping the data, extracting the data from the source system.

Data Pipeline Studio (DPS) provides templates for creating transformation jobs. The jobs include join/union/aggregate functions that can be performed to group or combine data for analysis.

For complex operations to be performed on data, DPS provides the option of creating custom transformation jobs. For custom queries while the logic is written by the users, DPS UI provides an option to create SQL queries by selecting specific columns of tables. Calibo Accelerate consumes the SQL queries along with the transformation logic, to generate the code for custom transformation jobs.

To create a Databricks templatized transformation job

Sign in to the Calibo Accelerate platform and navigate to Products.
Select a product and feature. Click the Develop stage of the feature and navigate to Data Pipeline Studio.
Create a pipeline with the following nodes:

Data Lake (Amazon S3) > Data Transformation (Databricks) > Data Lake (Amazon S3)

In the data transformation pipeline that you create, you can either add two data lake nodes or a single data lake node and connect the data transformation node to and from the same data lake node.
Click the Databricks node and click Create Templatized Job.

Complete the following steps to create the job:

Job Name

Provide job details for the data transformation job:

Template - Based on the source and destination that you choose in the data pipeline, the template is automatically selected.

Job Name - Provide a name for the data transformation job.
Node Rerun Attempts - This is the number of times the pipeline rerun is attempted on this node in case of pipeline failure. The default setting is done at the pipeline level.
Fault Tolerance - Define the behavior of the node upon failure, where the descendent nodes can either stop and skip execution or can continue their normal operation. The available options are:
- Default - If a node fails, the subsequent nodes go into pending state.
- Proceed on Failure - If a node fails, the subsequent nodes are executed.
- Skip on Failure - If a node fails, the subsequent nodes are skipped from execution.
For more information, see Fault Tolerance of Data Pipelines.

Transformation

Select the operation to perform for the transformation job from the following options:

Join

This operation joins or combines the data from the selected files. You can perform different type of join operations depending on your use case. Provide the following information for the join operation:

Join Name - By default a target file is created with the following naming convention: File1-File2-JOIN. You can rename the file as per your requirement.
Partition/File - Select a file from the two files that you want to use for the join operation. Say you select Doctor.
Alias Name - You can provide an alternate name for the file. By default,
Define Joins - Select the appropriate options for the type of join you want to perform.

Join Type	Inner Join - Use this option when you want to combine data that exists in both the tables.

Partition/File 2	Hospital
Alias Name	Hospital
On	Select the column with unique values in first table. In this case it is Hospital.HOSPTIAL_CODE
Join Operator	Select the operator. In this case it is =.
Field	Select the column with unique values in the second table. In this case it is Doctor.HOSPITAL_CODE

Join Type	Left OuterJoin - Use this option when you want to retain all the records from the left table (which is the main table) and add relevant records from the right table.

Partition/File 2	Hospital
Alias Name	Hospital
On	Select the column with unique values in first table. In this case it is Hospital.HOSPTIAL_CODE
Join Operator	Select the operator. In this case it is =.
Field	Select the column with unique values in the second table. In this case it is Doctor.HOSPITAL_CODE

Join Type	Right Outer Join - Use this option when you want to retain all the records from the right table (which is the main table) and add relevant records from the left table.

Partition/File 2	Hospital
Alias Name	Hospital
On	Select the column with unique values in first table. In this case it is Hospital.HOSPTIAL_CODE
Join Operator	Select the operator. In this case it is =.
Field	Select the column with unique values in the second table. In this case it is Doctor.HOSPITAL_CODE

Add Filter - You can also add a filter to the join operation. This filters the matching records from the data combined after performing the join operation.
Click Add.

Click Next.

Schema Mapping

In this step you map the source columns with the target columns.

Filter columns from selected tables- You deselect columns that you do not want to be included in the transformation query, as per your use case.
Add Custom Columns - Enable this option to add additional columns apart from the existing columns of the table. To add custom columns, do the following:

Click Add Custom Column after providing the details for each column. Repeat the steps for the number of columns that you want to add.
1. Column Name - Provide a column name for the custom column that you want to add.
2. Type and Value - Select the parameter type for the new column. Choose from the following options:
  - Static Parameter - Provide a static value that is added for this column.
  - System Parameter - Select a system-generated parameter from the dropdown list that must be added to the custom column.
  - Generated - Provide the SQL code to combine two or more columns to generate the value of the new column.
You can view the Added Custom Columns. You can update the custom column by clicking the pencil icon or delete it.
Click Next.

Target

Configure the target node.

Target - AWS S3 is auto-populated based on the pipeline you create.
Datastore - The S3 datastore is already selected.
Target Format - Select one the following target formats:
- Parquet - select this option if you want to use parquet format for target data.
- Delta Table - select this option if you want to create a table with delta data.
Base Target Folder - Select a folder on the target S3 bucket.
Subfolder - Provide a folder name that you want to create inside the Base Target Folder. This is optional.
Output Data Folder - Provide a folder name in which the output of the transformation job is stored.
Operation Type - choose the type of operation that you want to perform on the data files from the following options:
- Append - add new data to the existing data.
- Overwrite - replace the old data with new data.
Enable Partitioning - enable this option if you want to use partitioning for the target data. Select from the following options:
- Data Partition - Select the filename, column details, enter the column value. Click Add.
- Date Based Partitioning - Select the type of partitioning that you want to use for target date from the options - Yearly, Monthly, Daily. Add a prefix to the partition folder name. This is optional.
Complete Output Path - Review the final path of the target file. This is based on the inputs that you provide.
Audit Log Path - Displays the path where the audit logs for the job are stored.
Click Next.

What's next? Databricks Custom Transformation Job

Templatized Transformation using Databricks

Note:

Note: